policy iteration
Dual Policy Iteration
Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e.g., ExIt from [1], AlphaGo-Zero from [2]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e.g., a deep neural network) deployed at test time, and a slow, non-reactive policy (e.g., Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved with guidance from the reactive policy. In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using learned local models, and provides a theoretically sound way of unifying model-free and model-based RL approaches with unknown dynamics. We demonstrate the efficacy of our approach on various continuous control Markov Decision Processes.
- Asia > China > Beijing > Beijing (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- North America > Canada (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > New York (0.04)
- North America > United States > New Jersey (0.04)
- (3 more...)
Appendix: Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms
Thus the optimal average reward of the original MDP and modified MDP differ by O ( ϵ). To ensure Assumption 3.1 (b) is satisfied, an aperiodicity transformation can be implemented. The proof of this theorem can be found in [Sch71]. From Lemma 2.2, we thus have, ( J In order to iterate Equation (8), need to ensure the terms are non-negative. Theorem 3.3 presents an upper bound on the error in terms of the average reward.
- North America > United States > Illinois > Champaign County > Urbana (0.14)
- North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
LearningtoConstrainPolicyOptimizationwith VirtualTrustRegion
ComparedtoDeepQ-learning,deeppolicygradient (PG) methods are often more flexible and applicable to discrete and continuous action problems. However, these methods tend to suffer from high sample complexity and training instability since the gradient may not accurately reflect the policy gain when the policy changes substantially [6].
- Asia > China > Heilongjiang Province > Harbin (0.04)
- North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)